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Abstract. Despite much progress over the past decade, current Single Nucleotide Polymorphism 
(SNP) genotyping technologies still offer an insufficient degree of multiplexing when required to 
handle user-selected sets of SNPs. In this paper we propose a new genotyping assay architecture 
combining multiplexed solution-phase single-base extension (SBE) reactions with sequencing by 
hybridization (SBH) using universal DNA arrays such as all fc-mer arrays. In addition to PCR 
amplification of genomic DNA, SNP genotyping using SBE/SBH assays involves the following steps: 
(1) Synthesizing primers complementing the genomic sequence immediately preceding SNPs of 
interest; (2) Hybridizing these primers with the genomic DNA; (3) Extending each primer by a 
single base using polymerase enzyme and dideoxynucleotides labeled with 4 different fluorescent 
dyes; and finally (4) Hybridizing extended primers to a universal DNA array and determining 
the identity of the bases that extend each primer by hybridization pattern analysis. Under the 
assumption of perfect hybridization, unambiguous genotyping of a set of SNPs requires selecting 
primers upstream of the SNPs such that each primer hybridizes to at least one array probe that 
hybridizes to no other primer that can be extended by a common base. Our contributions include 
a study of multiplexing algorithms for SBE/SBH genotyping assays and preliminary experimental 
results showing the achievable tradeoffs between the number of array probes and primer length on 
one hand and the number of SNPs that can be assayed simultaneously on the other. We prove that 
the problem of selecting a maximum size subset of SNPs that can be unambiguously genotyped in a 
single SBE/SBH assay is NP-hard, and propose efficient heuristics with good practical performance. 
Our heuristics take into account the freedom of selecting primers from both strands of the genomic 
DNA as well as the presence of disjoint allele sets among genotyped SNPs. In addition, our heuristics 
can enforce user-specified redundancy constraints facilitating reliable genotyping in the presence of 
hybridization errors. Simulation results on datasets both randomly generated and extracted from 
the NCBI dbSNP database suggest that the SBE/SBH architecture provides a flexible and cost- 
effective alternative to genotyping assays currently used in the industry, enabling genotyping of up 
to hundreds of thousands of user-specified SNPs per assay. 


1 Introduction 

After the completion of the Human Genome Project has provided a blueprint of the DNA present in 
each human cell [15,16], genomics research is now focusing on the study of DNA variations that occur 
between individuals, seeking to understand how these variations confer susceptibility to common diseases 
such as diabetes or cancer. The most common form of genomic variation are the so called single nucleotide 
polymorphisms (SNPs), i.e., the presence of different DNA nucleotides, or alleles, at certain chromosomal 
locations. The vast majority of SNPs are bi-allelic, i.e., only two of the four possible DNA bases are 
observed at the SNP locus. Since human cells contain two copies of each chromosome (with the exception 
of sex chromosomes in males), both SNP alleles may be present in the DNA of an individual. Determining 
the identity of alleles present in a DNA sample at a given set of SNP loci is called SNP genotyping. 

The continuous progress in high-throughput genomic technologies has resulted in numerous SNP geno¬ 
typing platforms combining a variety of allele discrimination techniques (sequencing, direct hybridization, 
primer extension, allele-specific PCR, ligation, and cleavage, etc.), detection mechanisms (fluorescence, 
mass spectrometry, etc.) and reaction formats (solution phase, solid support, bead arrays), see, e.g., [17, 
19] for comprehensive reviews. However, current technologies still offer an insufficient degree of multi¬ 
plexing (below 10,000 SNPs per assay) for fully-powered genome wide disease association studies that 
require genotyping of large sets of user-selected SNPs [7]. The highest throughput is currently achieved by 
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high-density mapping arrays produced by Affymetrix, which can simultaneously genotype a fixed set of 
about 250,000 manufacturer selected SNPs per array. Genotyping a comparable number of user-specified 
set of SNPs would require an expensive and time-consuming re-design of array probes as well as a difficult 
re-engineering of the primer-ligation amplification protocol. 

Among technologies that allow genotyping of custom sets of SNPs one of the most successful ones 
is the use of DNA tag arrays [6,11,13,21]. DNA tag arrays consist of a set of DNA strings called tags, 
designed such that each tag hybridizes strongly to its own antitag (Watson-Crick complement), but to no 
other antitag. The flexibility of tag arrays comes from combining solid-phase hybridization with the high 
sensitivity of single-base extension reactions, which has also been used for SNP genotyping in combination 
with MALDI-TOF mass spectrometry [3]. A typical assay based on tag arrays performs SNP genotyping 
using the following steps [5,13]: (1) A set of reporter probes is synthesized by ligating antitags to the 5' end 
of primers complementing the genomic sequence immediately preceding the SNPs of interest. (2) Reporter 
probes are hybridized in solution with the genomic sample. (3) The hybridized 3' (primer) end of reporter 
probes is extended by a single base in a reaction using the polymerase enzyme and dideoxynucleotides 
fluorescently labeled with 4 different dyes. (4) Reporter probes are separated from the template DNA 
and hybridized to a tag array. (5) Finally, fluorescence levels are used to determine the identity of the 
extending dideoxynucleotides. Commercially available tag arrays have between 2,000 and 10,000 tags ]1, 
2]. The number of SNPs that can be genotyped per array is typically smaller than the number of tags since 
some of the tags must remain unassigned due to cross-hybridization with the primers [5,22]. Another 
factor limiting the wider use of tag arrays is the relatively high cost of synthesizing the reporter probes, 
which have a typical length of 40 nucleotides. 

In the fc-mer array format [9], all 4* DNA probes of length k are spotted or synthesized on the 
solid array substrate (values of fc of up to 10 are feasible with current high-density in-situ synthesis 
technologies). This format was originally proposed for performing sequencing by hybridization (SBH), 
which seeks to reconstruct an unknown DNA sequence based on its fc-mer spectrum [25]. However, the 
sequence length for which unambiguous reconstruction is possible with high probability is surprisingly 
small [26], and, despite several suggestions for improvement, such as the use of gapped probes [12] and 
pooling of target sequences [14], the SBH scheme has not become practical so far. 

In this paper we propose a new genotyping assay architecture combining multiplexed solution-phase 
single-base extension (SHE) reactions with sequencing by hybridization (SBH) using universal DNA 
arrays such as all /c-mer arrays. SNP genotyping using SBE/SBH assays requires the following steps (see 
Eigure 1): (1) Synthesizing primers complementing the genomic sequence immediately preceding SNPs 
of interest; (2) Hybridizing primers with the genomic DNA; (3) Extending each primer by a single base 
using polymerase enzyme and dideoxynucleotides labeled with 4 different fluorescent dyes; and finally (4) 
Hybridizing extended primers to a universal DNA array and determining the identity of the bases that 
extend each primer by hybridization pattern analysis. 

To the best of our knowledge the combination of the two technologies in the context of SNP genotyping 
has not been explored thus far. The most closely related genotyping assay is the generic Polymerase 
Extension Assay (PEA) recently proposed in [27]. In PEA, short amplicons containing the SNPs of 
interest are hybridized to an all fc-mers array of primers that are subsequently extended via single-base 
extension reactions. Hence, in PEA the SBE reactions take place on solid support, similar to arrayed 
primer extension (APEX) assays which use SNP specific primers spotted on the array [28]. 

As in [14], the SBE/SBH assay leads to high array probe utilization since we hybridize to the array a 
large number of short extended primers. However, the main power of the method lies in the fact that the 
sequences of the labeled oligonucleotides hybridized to the array are a priori known (up to the identity 
of extending nucleotides). While genotyping with SBE/SBH assays uses similar general principles as 
the PEA assays proposed in [27], there are also significant differences. A major advantage of SBE/SBH 
is the much shorter length of extended primers compared to that of PCR amplicons used in PEA. A 
second advantage is that all probes hybridizing to an extended primer are informative in SBE/SBH 
assays, regardless of array probe length (in contrast, only probes hybridizing with a substring containing 
the SNP site are informative in PEA assays). As shown by the experimental results in Section 4 these 
advantages translate into an increase by orders of magnitude in multiplexing rate compared to the results 
reported in [27]. We further note that PEA’s effectiveness crucially depends on the ability to amplify 
very short (preferably 40bp or less) genomic fragments spanning the SNP loci of interest. This limits 
the achievable degree of multiplexing in PCR amplification [18], making PCR amplification the main 
bottleneck for PEA assays. Full flexibility in picking PCR primers is preserved in SBE/SBH assays. 
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Fig. 1. SBE/SBH assay: (a) Primers complementing genomic sequence upstream of each SNP locus are mixed in 
solution with the genomic DNA sample, (b) Temperature is lowered allowing primers to hybridize to the genomic 
DNA. (c) Polymerase enzyme and dideoxynucleotides labeled with 4 different fluorescent dyes are added to the 
solution, causing each primer to be extended by a nucleotide complementing the SNP allele, (d) Extended primers 
are hybridized to a universal DNA array (an all k-mer array for k=2 is shown) and SNP genotypes are determined 
by analyzing the resulting hybridization pattern. Under the assumption of perfect hybridization, unambiguous 
genotyping of the SNPs requires that each primer hybridizes to at least one array probe that hybridizes to no 
other primer that can be extended by a common base. 


The rest of the paper is organized as follows. In Section 2 we formalize two problems that arise in 
genotyping large sets of SNPs using SBE/SBH assays: the problem of partitioning a set of SNPs into the 
minimum number of “decodable” subsets, i.e., subsets of SNPs that can be unambiguously genotyped 
using a single SBE/SBH assay, and that of finding a maximum decodable subset of a given set of SNPs. We 
also establish hardness results for the latter problem. In Section 3 we propose several efficient heuristics. 
Finally, in Section 4 we present experimental results on both randomly generated datasets and instances 
extracted from the NCBI dbSNP database, exploring achievable tradeoffs between the type/number of 
array probes and primer length on one hand and number of SNPs that can be assayed per array on the 
other. Our results suggest that the SBE/SBH architecture provides a flexible and cost-effective alternative 
to genotyping assays currently used in the industry, enabling genotyping of up to hundreds of thousands 
of user-selected SNPs per assay. 

2 Problem Formulations and Complexity 

A set of SNP loci can be unambiguously genotyped by SBE/SBH if every combination of SNP genotypes 
yields a different hybridization pattern (defined as the vector of dye colors observed at each array probe). 
To formalize the requirements of unambiguous genotyping, let us first consider a simplified SBE/SBH 
assay consisting of four parallel single-color SBE/SBH reactions, one for each possible SNP allele. Under 
this scenario, only one type of dideoxynucleotide is added to each SBE reaction, corresponding to the 
complement of the tested SNP allele. Therefore, a primer is extended in such a reaction if the tested allele 
is present at the SNP locus probed by the primer, and is left un-extended otherwise. 



























Let V be the set of primers used in a single-color SBE/SBH reaction involving dideoxynucleotide 
e G {A,C,G,T}. From the resulting hybridization pattern we must be able to infer for every p G V 
whether or not p was extended by e. The extension of p by e will result in a fluorescent signal at all array 
probes that hybridize with pe. However, some of these probes can give a fluorescent signal even when p 
is not extended by e, due to hybridization to other extended primers. Since in the worst case all other 
primers are extended, it must be the case that at least one of the probes that hybridize to pe does not 
hybridize to any other extended primer. 

Formally, let X C {A,C,G,T}* be the set of array probes. For every string y G {A,C,G,T}*, let 
the spectrum of y in X, denoted Specxiy), be the set of probes of X that hybridize with y. Under 
the assumption of perfect hybridization, Specx{y) consists of those probes of X that are Watson-Crick 
complements of substrings of y. Then, a set of primers V is said to be decodable with respect to extension 
e if and only if, for every p gV, 

Specx{pe)\ y Specx{p'e)^% (1) 

p'e'P\{p} 

Decoding constraints (1) can be directly extended to 4-color SBF/SBH experiments, in which each 
type of extending base is labeled by a different fluorescent dye. As before, let V be the set of primers, 
and, for each primer p GV, let Ep C {A, G,G,T} be the set of possible extensions of p, i.e., Watson-Crick 
complements of corresponding SNP alleles. If we assume that any combination of dyes can be detected 
at an array probe location, unambiguous decoding is guaranteed if, for every p GV and every extending 
nucleotide e G Ep, 

Specxipe) \ y Specxfp'e) ^ 0 (2) 

P'GP\{p},e&E^, 

In the following, we refine (2) to improve practical reliability of SBF/SBH assays. More precisely, we 
impose additional constraints on the set of probes considered to be informative for each SNP allele. First, 
to enable reliable genotyping of genomic samples that contain SNP alleles at very different concentra¬ 
tions (as a result of uneven efficiency in the PCR amplification step or of pooling DNA from different 
individuals), we require that a probe that is informative for a certain SNP locus must not hybridize to 
primers corresponding to different SNP loci, regardless of their extension. Second, since recent studies by 
Naef et al. [23] suggest that fluorescent dyes can significantly interfere with oligonucleotide hybridization 
on solid support, possibly destabilizing hybridization to a complementary probe on the array, in this 
paper we use a conservative approach and require that each probe that is informative for a certain SNP 
allele must hybridize to a strict substring of the corresponding primer. On the other hand, informative 
probes are still required not to hybridize with any other extended primer, even if such hybridizations in¬ 
volve fluorescently labeled nucleotides. Finally, we introduce a decoding redundancy parameter r > 1, and 
require that each SNP have at least r informative probes, i.e., probes that hybridize to the correspond¬ 
ing primer but do not hybridize to any other extended primer. Such a redundancy constraint facilitates 
reliable genotype calling in the presence of hybridization errors. Clearly, the larger the value of r, the 
more hybridization errors that can be tolerated. If a simple majority voting scheme is used for making 
allele calls, the assay can tolerate up to [r/2j hybridization errors involving the r informative probes of 
each SNP. Furthermore, since the informative probes of a SNP are required to hybridize exclusively with 
the primer corresponding to the SNP, the redundancy requirement provides a powerful mechanism for 
detecting and gauging the extent of hybridization errors. Indeed, each unintended hybridization at an 
informative probe for a bi-allelic SNP has a dye complementary to one of the SNP alleles with probability 
of only 1/2, and the probability that k such errors pass undetected decreases exponentially in k. 

The refined set of constraints is captured by the following definition, where, for every primer p G 
{A, G, G,T}* and set of extensions E C {A, G, G,T}, we let 

Specx{p,E) = (J Specxipe) 

eeE 

Definition 1. A set of primers V is said to be strongly r-decodable with respect to extension sets Ep, 
p G V, if and only if, for every p GV, 

Specx{p)\ IJ Specxip', Ep,) >r 

p'€:V\{p} 


( 3 ) 



Note that testing whether or not a given set of primers is strongly r-decodable can be easily accomplished 
in time linear in the total length of the primers. 

Genotyping a large set of SNPs will, in general, require more than one SBE/SBH assay. This rises 
the problem of partitioning a given set of SNPs into the smallest number of subsets that can each be 
genotyped using a single SBE/SBH assay. For each SNP locus there are typically two different primers 
that can be used for genotyping. As shown in [22] for the case of SNP genotyping using tag arrays, 
exploiting this degree of freedom significantly increases achievable multiplexing rates. Therefore, we next 
extend our definitions to capture this degree of freedom. Let Pi be the pool of primers that can be used 
to genotype the SNP at locus i. Similarly to Definition 1, we have: 

Definition 2 . A set of primer pools V = {Pi,...,P„} is said to be strongly r-decodable if and only 
if there is a primer pi in each pool Pi sueh that {pi ,... ,p„} is strongly r-decodable with respect to the 
respective extension sets Ep^, i = 1,... ,n. 

Primers pi,p 2 , ■ ■ ■ ,Pn above are called the representative primers of pools Pi, P 2 ,. ■., Pn, respectively. 
The SNP partitioning problem can then be formulated as follows: 

Minimum Pool Partitioning Problem (MPPP): Given primer pools V = {Pi,... ,Pn}, associated 
extension sets Ep, p G probe set X, and redundancy r, find a partitioning ofV into the minimum 

number of strongly r-decodable subsets. 

A natural strategy for solving MPPP, similar to the well-known greedy algorithm for the set cover 
problem, is to find a maximum strongly r-decodable subset of pools, remove it from P, and then repeat 
the procedure until no more pools are left in V. This greedy strategy for solving MPPP has been shown to 
empirically outperform other algorithms for solving the similar partitioning problem for PEA assays [27]. 
In the case of SBE/SBH, the optimization involved in the main step of the greedy strategy is formalized 
as follows: 

Maximum r-Decodable Pool Subset Problem (MDPSP): Given primer pools V = {Pi, ■ ■ ■ ,Pn], 
associated extension sets Ep, p G U[L]^Pi, probe set X, and redundancy r, find a strongly r-decodable 
subset P' CP of maximum size. In addition, for eaeh pool Pi G P', find its representative primer. 

Unfortunately, as shown in next theorem, MDPSP is NP-hard even for the case when the redundancy 
parameter is 1 and each pool has exactly one primer. 

Theorem 1. MDPSP is NP-hard, even when restrieted to instances with r = 1 and jPj = 1 for every 
PGP. 

Proof. We will use a reduction from the maximum induced matching problem in bipartite graphs, which 
is defined as follows: 

Maximum Induced Matching (MIM) Problem in Bipartite Graphs: Given a bipartite graph 
G = (U U V,E), find maximum size subsets U' Q U, V' C V, with \U'\ = \V'\ such that the subgraph of 
G induced by U' AV' is a matching. 

The MIM problem in bipartite graphs is known to be NP-hard even for graphs with maximum degree 
3 [20]. Let G = {U U V, E) be such a bipartite graph with maximum degree 3. Without loss of generality 
we may assume that every vertex in G has degree at least 1. We will denote by N{u) the neighborhood of 
vertex u G U L)V, i.e., the set of vertices adjacent with u in G. 

We construct an instance of MDPSP as follows: Let r = 1 and I = [log 2 [U]]. For every v gV we add 
to X a distinct probe Xy G {A,T}*; note that this can be done since 1{A,T}*| = 2* > jUj by our choice of 
1. For every u G U, with neighborhood N{u) = {vi,V 2 , U 3 }, we construct a primer = Xv^Cxv.^Cxv,j and 
set Pu = {Pu}- We use a similar construction for vertices u G U with only 1 or 2 neighbors. Note that in 
each case the pool consists of a single primer of length at most 31-1-2. For each constructed primer 
p, the set of possible extensions is defined as Ep = {G,C}. Since the probes of X contain only A’s and 
T’s, for every primer pu, u G U, 


Specx{pu,EpJ = Specx{pu) = [xv G X\ v G N{u)} 



Input: Pools V = {Pi,..., Pn}, extension sets Ep, p £ U{LiPi, probe set X, and redundancy r 

Output: Strongly r-decodable subset of pools P' C P and set R of representative primers for the pools in P’ 


0. P' <- 0, P ^ 0 

1. For each P G P do 

2. For each p £ P do 

3. If P U {p} satisfies (3) 
Then 

4. V ^r'UP 

4. P ^ P U {p} 

5. Exit inner For 
End If 

End For 
End For 


Fig. 2. The Sequential Greedy algorithm. 


Let U' C U, V C V, \U'\ = \V'\, be subsets of vertices such that U' U V induces a matching in 
G. Let V' = {Pu\ u £ [/'}. For every u £ U', exactly one of u's neighbors, denoted appears in V', 

because U' U V induces a matching. Furthermore, for each u' £ U' \ {m}, ^ E, and therefore 

^ SpecxiPuU Ep^,). Thus, for every u £ U', 

Xp^ £ Specx{Pu)\ U Specx{pu’,Ep^,) 

{Pu'}&'P'\{Pu} 

which means that P' is a strongly 1-decodable subset of pools of the same size as the induced matching 
of G. 

Conversely, let V' be a strongly 1-decodable subset of P, and let U' = {u £ U\ {p„} G V'}. Since V' is 
1-decodable, for every primer with {pu} £ P^ there must exist a probe x £ X such that x £ Specx{Pu) 
and X ^ Specx{Pu',Ep^,) for every {pu'} £ V' \ {pu}- Because Specx{Pu) = {xp £ X\ v £ N{u)}, it 
follows that every vertex u £ U' has a neighbor v £ V that is not a neighbor of any other u' £ U' \ {m}. 
Let Vu be such a neighbor (pick Vu arbitrarily if more than one vertex in V satisfies above property), and 
let V = {v„| u G [/'}. It is clear that U' U V induce a matching of size |P'| in G. 

Thus, for every integer k, there is a one-to-one correspondence between induced matchings of size k in 
G and strongly 1-decodable subsets of k pools in the constructed instance of MDPSP, and NP-hardness 
of MDPSP follows. 

The reduction in the proof of Theorem 1 preserves the size of the optimal solution, and therefore 
any hardness of approximation result for the MIM in bipartite graphs will also hold for MDPSP, even 
when restricted to instances with r = 1 and |P| = 1 for every P £ V. Since Duckworth et al. [10] proved 
that it is NP-hard to approximate MIM in bipartite graphs with maximum degree 3 within a factor of 
6600/6659, we get: 

Theorem 2. It is NP-hard to approximate MDPSP within a factor of 6600/6659, even when restricted 
to instances with r = 1 and |P| = 1 for every P £ P. 

3 Algorithms 

In this section we describe three heuristic approaches to MDPSP. The Hrst one is a naive greedy algorithm 
that sequentially evaluates the primers in the given pools in an arbitrary order. The algorithm picks a 
primer p to be the representative of pool P £ V ii p together with the representatives already picked 
satisfy condition (3). The pseudocode of this algorithm, which we refer to as Sequential Greedy, is given 
in Figure 2. 

The next two algorithms are inspired by the Min-Greedy algorithm in [10], which approximates MIM 
in d-regular graphs within a factor of d — 1. For the MIM problem, the Min-Greedy algorithm picks at 





remove-primer (p) 


Begin 

For all X € {p) do 

A''+(a;) <— A'"■'■(a:) \ {p} 

If |A/+(r)| = 0 
Then remove-probe (x) 
End If 
End For 

For all X € N~ (p) do 
N-{x) ^ N-{x) \ {p} 

End For 

Delete vertex p from graph G 

End 


Fig. 3. The remove-primer subroutine. 


each step a vertex u of minimum degree and a vertex v, which is a minimum degree neighbor of u. All 
the neighbors of u and v are deleted and the edge (u, v) is added to the induced matching. The algorithm 
stops when the graph becomes empty. 

Each instance of MDPSP can be represented as a bipartite hybridization graph G = ((U^i P)j 

with the left side containing all primers in the given pools and the right side containing the array probes, 
i.e., X. There is an edge between primer p and probe a; € A iff a; G Specx{p, Ep). As discussed in Section 2, 
we need to distinguish between the hybridizations that involve fluorescently labeled nucleotides and those 
that do not. Thus, for every primer p, we let N^{p) = Specx{p) and N~{p) = Specxip, Ep) \ Specx{p)- 
Similarly, for each probe x G X, we let N~^{x) = {p\ x G N~^{p)} and N~{x) = {p\ x G N~{p)}. 

We considered two versions of the Min-Greedy algorithm when run on the bipartite hybridization 
graph, depending on the side from which the minimum degree vertex is picked. In the first version, 
referred to as MinPrimerGreedy, we pick hrst a minimum degree node from the primers side, while in the 
second version, referred to as MinProbeGreedy, we pick hrst a minimum degree node from the probes 
side. Thus, MinPrimerGreedy greedy picks at each step a minimum degree primer p and pairs it with 
a minimum degree probe x G N~^{p). MinProbeGreedy greedy, selects at each step a minimum degree 
probe x and pairs it with a minimum degree primer p in N~^(x). In both algorithms, all neighbors of p 
and X and their incident edges are removed from G. Also, at each step, the algorithms remove all vertices 
u, for which N~^{u) = 0. These deletions ensure that the primers p selected at each step satisfy condition 
(3). Both algorithms stop when the graph becomes empty. 

As described so far, the MinPrimerGreedy and MinProbeGreedy algorithms work when each pool 
contains only one primer and when the redundancy is 1. We extended the two variants to handle pools of 
size greater than 1 by simply removing from the graph all primers p' G P\{p} when picking primer p from 
pool P. If the redundancy r is greater than 1, then whenever we pick a primer p, we also pick it’s r probe 
neighbors from N~^{p) with the smallest degrees (breaking ties arbitrarily). The primer neighbors of all 
these r probes will then be deleted from the graph. Moreover, the algorithm maintains the invariant that 
I A“'"(p)| > r for every primer p and |A“'"(a;)| > 1 for every probe x by removing primers/probes for which 
the degree decreases below these bounds. Figures 5 and 6 give the pseudocode for the MinPrimerGreedy, 
respectively the MinProbeGreedy greedy algorithms. For the sake of clarity, they use two subroutines for 
removing a primer vertex, respectively a probe vertex, which are described in Figures 3 and 4. 

Algorithms MinPrimerGreedy and MinProbeGreedy can be implemented efficiently using a Fibonacci 
heap for maintaining the degrees of primers, respectively of probes. Let N be the total number of primers 
in the n pools, m be the number of probes in X, and k be the size of the r-decodable set returned by the 
algorithm. Since each primer has bounded degree, the sorting of probe degrees requires 0{k) total time. 
The total number of edges in the hybridization graph is 0{N + m). By using a Fibonacci heap, finding a 
minimum degree primer (probe) can be done in 0(\ogN) (respectively O(logm)) and each primer degree 
update can be done in amortized 0(1) time. Thus, the total runtime for MinPrimerGreedy algorithm is 
0{k\ogN + N -\- m), and the total runtime for MinProbeGreedy algorithm is 0{k\ogm + N + rn). 




remove-probe (x) 


Begin 

For all p £ N'^ (x) do 
N+{p) ^ N+{p) \ {x} 

If \N+{p)\ < r 
Then remove-primer (p) 
End If 
End For 

For all p £ N~ (x) do 
N-{p) ^ N-(p) \ {r} 

End For 

Delete vertex x from graph G 

End 


Fig. 4. The remove-probe subroutine. 


4 Experimental Results 

We considered two types of data sets: 

— Randomly generated datasets containing between 1,000 to 200,000 pools with 1 or 2 primers of length 
between 10 and 30. 

— Two-primer pools representing over 9 million reference SNPs in human chromosomes 1-22, X, and 
Y extracted from the NCBI dbSNP database build 125. We disregarded reference SNPs for which 
available flanking sequence was insufficient for determining two non-degenerate primers of desired 
length (due, e.g., to the presence of degenerate bases near the SNP locus). 

We used two types of array probe sets. First, we used probe sets containing all fc-mers, for k between 
8 and 10. All fc-mer arrays are well studied in the context of sequencing by hybridization. However, a 
major drawback of all k-mer arrays is that the fc-mers have a wide range of melting temperatures, making 
it difficult to ensure reliable hybridization results. For short oligonucleotides, a good approximation of 
the melting temperature is obtained using the simple 2-4 rule of Wallace [29], according to which the 
melting temperature of a probe is approximately twice the number of A and T bases, plus four times 
the number of C and G bases. As in [4], we define the weight of a DNA string to be the number of A 
and T bases plus twice the number of C and G bases. For a given integer c, a DNA string is called a 
c-token if it has a weight c or more and all its proper suffixes have weight strictly less than c. Since the 
weight of a c-token is either c or c-|- 1, it follows that the 2-4 rule computed melting temperature of all 
c-tokens varies in a range of about 4°G. In our experiments we used probe sets consisting of all c-tokens, 
with c varying between 11 and 13. The considered values of fc and c were picked such that the resulting 
number of probes is representative of current array manufacturing technologies: there are roughly 65,000 
8 -mers, 262,000 9-mers, 1 million 10-mers, 86,000 11-tokens, 236,000 12-tokens, and 645,000 13-tokens 
- the smaller probe sets can be spotted using current oligonucleotide printing robots, while the larger 
probe sets can be synthesized in situ using photolithographic techniques. 


4.1 Results on Synthetic Datasets 

In a first set of experiments on the randomly generated datasets we compared the three MDPSP algo¬ 
rithms on instances with primer length set to 20, which is the typical length used, e.g., in genotyping 
using tag arrays. In these experiments the set of possible extensions was considered to be {A,G,T,G} for 
all primers. Such a conservative choice gives an estimate of multiplexing rates achievable by SBE/SBH 
assays in more demanding genomic analyses such as microorganism identification by DNA barcoding [8], 
in which a primer (typically referred to as a distinguisher in this context) may be extended by any of the 
DNA bases in different microorganisms. The results of these experiments for all fc-mer and all c-token 
probe sets are presented in Tables 1 and 2, respectively. The results show that using the flexibility of 




Input: Pools V = {Pi,..., Pn}, extension sets Ep, p € U{LiPi, probe set X, and redundancy r 

Output: Strongly r-decodable subset of pools P' C P and set R of representative primers for the pools in P’ 


Construct hybridization graph G 
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0 

While G is not empty do 

Find a minimum degree primer p, and let P be the pool of p 
P' ^ P' U {P} 

R ^ Rvj {p} 

For each {p') G P \ {p} do 
remove-primer(p^) 

End For 

Let |A^+(p)| = k and let {xi,... ,Xk} be the probes in N^{p), indexed in increasing order of their degrees 
For each x G (xi,. . . , Xr} do 

For each {p') £ N~^ {x) U N~ {x) do 
remove-primer(p') 

End For 

Delete vertex x from G 
End For 

For each x G {Xr+i,.. . , Xfc} U N~ (p) do 
remove-probe(x) 

End For 
End While 


Fig. 5. MinPrimerGreedy greedy algorithm. 


picking primers from either strand of the genomic sequence yields an improvement of up to 10% in the 
number of r-decodable pools. The MinProbeGreedy algorithm typically produces better results compared 
to the MinPrimerGreedy variant. On the other hand, neither Sequential Greedy nor MinProbeGreedy 
dominates the other algorithms for all range of instance parameters - Sequential Greedy generally gives 
the better results for fc-mer experiments with high redundancy values, while MinProbeGreedy generally 
gives better results for k-mer experiments with large number of pools and low redundancy and for c-token 
experiments. 

In the second set of experiments we ran the three MDPSP algorithms on datasets with the same 
primer length of 20, pool size of 2, and with the number of possible extensions of each primer set to 4 as 
in DNA-barcoding applications, and to 2 as in SNP genotyping. The results for all fc-mer and all c-token 
probe sets are given in Tables 3 and 4. The relative performance of the algorithms is similar to that 
observed in the first set of experiments. As expected, taking into account the reduced number of possible 
extensions increases the size of computed decodable pool subsets, often by more than 5%. 

In the third set of experiments we explored the degree of freedom given by the primer length. For any 
fixed array probe set and redundancy requirement, we need a minimum primer length to be able to satisfy 
constraints (3). Increasing the primer length beyond this minimum primer length is often beneficial, as it 
increases the number of array probes that hybridize with the primer. However, if primer length increases 
too much, an increasing number of these probes become non-specific, and the multiplexing rate starts to 
decline. Figure 7 gives the tradeoff between primer length and the size of the strongly r-decodable pool 
subsets computed by the three MDPSP algorithms for pools with 2 primers, 2 possible extensions per 
primer and all 10-mers, respectively all 13-tokens, as array probes. We notice that the optimal primer 
length increases with the redundancy parameter. 

4.2 Results on dbSNP Data 

To stress-test our methods, we extracted a total of over 9 million 2-primer pools corresponding to reference 
SNPs in human chromosomes 1-22, X, and Y in the NCBI dbSNP database build 125. We constructed 
a dataset for each of the 24 chromosomes by creating a 2-primer pool for each reference SNP for which 
dbSNP contains at least 20 non-degenerate base pairs of flanking sequence on both sides (the number of 




Input: Pools V = {Pi,..., Pn}, extension sets Ep, p £ U{LiPi, probe set X, and redundancy r 
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For each x £ {xi,. . . ,Xr} do 

For each p' £ N^{x) U N~{x) do 
remove-primer(p ) 

End For 

Delete vertex x from G 
End For 

For each X £ {Xr+i, ■ ■ ■ ,Xk} U N~ (p) do 
remove-probe(r) 

End For 
End While 


Fig. 6. MinProbeGreedy greedy algorithm. 


reference SNPs and extracted pools for each chromosome are given in Table 5). Since these large sets of 
pools must be partitioned between multiple SBE/SBH experiments, we used a simple MPPP algorithm 
which iteratively finds maximum r-decodable pool subsets using the sequential greedy algorithm. 

Figures 8 and 9 give the cumulative coverage percentage for the first 50 arrays of all 10-mers, respec¬ 
tively all 13-tokens, on the set of pools extracted from the human chromosome 1. In these experiments 
we used redundancy between 1 and 5, and primer length 14 or 20. While the MDPSP size in the first few 
iterations of our MPPP algorithm is comparable to those reported for randomly generated datasets in 
Section 4.1, the number of SNPs assayed per array decreases constantly with array number - as we need 
to assay more and more “difficult” SNPs. Somehow surprisingly, the results also suggest using primers of 
different lengths in different SBE/SBH experiments: while a primer length of 14 seems to be optimal for 
the first few arrays, longer primers improve the degree of multiplexing when only hard to differentiate 
SNPs remain, especially for high redundancy. 

Finally, in Table 5 we give the number of arrays (containing either all 10-mers or all 13-tokens) 
required to cover 90%, respectively 95% of the extracted reference SNPs, when using primers of length 
20. In practical association studies a much lower SNP coverage (and hence much fewer arrays) would be 
required due to the high degree of linkage disequilibrium between the SNPs in the human population 
[24]. 
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Table 1. Size of the strongly r-decodable pool subset computed by the three MDPSP algorithms for prii 
length 20 and set of possible extensions {A,C,T,G}, with redundancy r £ {1,2,5} and all fc-mer probe sets 
k £ {8,9,10} (averages over 10 test cases). 


r 

# 

pools 

Algorithm 

k 

1 primer 

=8 

2 primers 

k 

1 primer 

=9 

2 primers 

k= 

1 primer 

MO 

2 primers 



^Sequential 

1000 

1000 

1000 

1000 

1000 

1000 


1000 

MinPrimer 

1000 

1000 

1000 

1000 

1000 

1000 



MinProbe 

1000 

1000 

1000 

1000 

1000 

1000 



Sequential 

2000 

2000 

2000 

2000 

2000 

2000 


2000 

MinPrimer 

2000 

2000 

2000 

2000 

2000 

2000 



MinProbe 

2000 

2000 

2000 

2000 

2000 

2000 



Sequential 

7740 

8574 

9991 

10000 

10000 

10000 


10000 

MinPrimer 

7714 

8319 

9991 

9999 

10000 

10000 

1 


MinProbe 

7768 

8803 

9991 

10000 

10000 

10000 



Sequential 

9967 

11071 

19436 

19948 

19999 

20000 


20000 

MinPrimer 

9889 

10999 

19447 

19745 

19999 

20000 



MinProbe 

9886 

11107 

19458 

19989 

19999 

20000 



Sequential 

12486 


43279 

47688 

93632 

98630 


100000 

MinPrimer 

13864 

■RSI 

42980 

48021 

93642 

96712 



MinProbe 

13993 

15672 

43273 

48418 

93837 

99601 



Sequential 

12635 

12658 

49062 

51646 

140820 

157908 


200000 

MinPrimer 

15476 

17010 

50347 

56017 

139787 

154028 



MinProbe 

15822 

17630 

50459 

56676 

141614 

160532 



Sequential 

1000 

1000 

1000 

1000 

1000 

1000 


1000 

MinPrimer 

1000 

1000 

1000 

1000 

1000 

1000 



MinProbe 

1000 

1000 

1000 

1000 

1000 

1000 



Sequential 

1997 

2000 

2000 

2000 

2000 

2000 


2000 

MinPrimer 

1997 

2000 

2000 

2000 

2000 

2000 



MinProbe 

1997 

2000 

2000 

2000 

2000 

2000 



Sequential 

6210 


9934 

9999 

10000 

10000 


10000 

MinPrimer 

6002 


9932 

9977 

10000 

10000 

2 


MinProbe 

6174 

6890 

9938 

9998 

10000 

10000 



Sequential 

7463 

8192 

17948 

19274 

19992 

20000 


20000 

MinPrimer 

7052 

7662 

17812 

18455 

19992 

20000 



MinProbe 

7435 

8068 

18004 

19288 

19993 

20000 



Sequential 

9254 

9644 

31845 

34855 

82315 

90627 


100000 

MinPrimer 

8917 

9605 

30043 

32700 

81056 

85852 



MinProbe 

9404 

10273 

31805 

34481 

82522 

90935 



Sequential 

9674 

9953 



109450 

122470 


200000 

MinPrimer 

9658 

10333 



104891 

114624 



MinProbe 

10326 

11246 

35228 

38498 

109252 

122986 



Sequential 

995 

1000 

1000 

1000 

1000 

1000 


1000 

MinPrimer 

995 


1000 

1000 

1000 

1000 



MinProbe 

995 

1000 

1000 

1000 

1000 

1000 



Sequential 

1872 

1973 


2000 

2000 

2000 


2000 

MinPrimer 

1860 

1898 

IIB 

2000 

2000 

2000 



MinProbe 

1866 

1946 

■B 

2000 

2000 

2000 



Sequential 

3745 

4161 

8674 

9i83 

9972 

10000 


10000 

MinPrimer 

3376 

3635 

8484 

8881 

9969 

9998 

5 


MinProbe 

3480 

3845 

8564 

9233 

9970 

10000 



Sequential 

4289 

4705 

12204 

13750 

19498 

19967 


20000 

MinPrimer 

3748 

4029 

11393 

12360 

19435 

19804 



MinProbe 

3943 

4286 

11680 

12960 

19468 

19931 



Sequential 

5241 

5520 

17920 

19612 

52078 

59021 


100000 

MinPrimer 

4450 

4726 

15580 

16781 

47922 

52711 



MinProbe 

4818 

5171 

16521 

17990 

49329 

55573 



Sequential 

5534 

5775 

19767 

21251 


70334 


200000 

MinPrimer 

4724 

4990 

16959 

18116 


61406 



MinProbe 

5177 

5531 

18175 

19757 

58565 

65344 














Table 2. Size of the strongly r-decodable pool subset computed by the three MDPSP algorithms for primer 
length 20 and set of possible extensions {A,C,T,G}, with redundancy r £ {1, 2, 5} and all c-token probe sets for 
c G {11,12,13} (averages over 10 test cases). 


r 

# 

Algorithm 

c= 

Ml 

c= 

M2 

c= 

M3 


pools 


1 primer 

2 primers 

1 primer 

2 primers 

1 primer 

2 primers 



Sequential 

991 

1000 



1000 

1000 


1000 

MinPrimer 

992 

999 

■H 

BoiiliH 

1000 

1000 



MinProbe 

993 

1000 

999 


1000 

1000 



Sequential 

1881 

1982 

1986 

2000 

1999 

2000 


2000 

MinPrimer 

1890 

1959 

1987 

1998 

1999 

2000 



MinProbe 

1906 

1994 

1988 

2000 

1999 

2000 



Sequential 

5745 

6993 

8006 

9218 

9420 

9927 


10000 

MinPrimer 

5556 

6401 

8005 

8782 

9472 

9801 

1 


MinProbe 

6385 

7972 

8436 

9688 

9550 

9980 



Sequential 

7968 


12458 

15191 

16656 

18931 


20000 

MinPrimer 

7490 

8798 

12242 

14080 

16673 

18204 



MinProbe 

9190 

11548 

13684 

17094 

17430 

19613 



Sequential 

13708 

16042 

26407 

32202 

45064 

56064 


100000 

MinPrimer 

12564 

14736 

24482 

29336 

42824 

51540 



MinProbe 

16820 

20277 

31414 

39202 

51448 

65877 



Sequential 

16241 

18516 

33278 

39552 

61351 

76037 


200000 

MinPrimer 

14967 

17278 

30762 

36618 

57530 

70048 



MinProbe 

20574 

24329 

40580 

49300 

72230 

91488 



Sequential 

965 

998 

997 

1000 

1000 

1000 


1000 

MinPrimer 

965 

986 

997 

999 

1000 

1000 



MinProbe 

972 

998 

997 

1000 

1000 

1000 



Sequential 

Tm 

1905 

1940 

1995 

1995 

2000 


2000 

MinPrimer 

1697 

1815 

1942 

1981 

1995 

2000 



MinProbe 

1766 

1948 

1951 

1997 

1996 

2000 



Sequential 

4216 

5107 



8616 

9611 


10000 

MinPrimer 

3926 

4571 


mmwM 

8572 

9214 

2 


MinProbe 

4876 

6059 

7138 

8610 

8896 

9783 



Sequential 

5482 

6589 

9450 

11615 

14060 

16839 


20000 

MinPrimer 

5024 

5901 

8919 

10551 

13699 

15613 



MinProbe 

6635 

8151 

10796 

13540 

15152 

17980 



Sequential 

8587 

9839 

17469 

20811 

32223 

39839 


100000 

MinPrimer 

7897 

9071 

16133 

19192 

30138 

36595 



MinProbe 

10990 

12695 

21738 

26341 

38246 

48131 



Sequential 

9899 

11114 

21192 

24696 

41783 

50811 


200000 

MinPrimer 

9149 

10418 

19730 

23155 

39125 

47357 



MinProbe 

12782 

14541 

26957 

31714 

51198 

63112 



Sequential 

787 

906 

947 

992 

992 

1000 


1000 

MinPrimer 

767 

837 

941 

971 

992 

999 



MinProbe 

794 

905 

947 

990 

992 

1000 



Sequential 

1187 

1433 

1646 

1870 

1914 

1991 


2000 

MinPrimer 

1112 

1284 

1600 

1753 

1903 

1960 



MinProbe 

1204 

1437 

1652 

1856 

1914 

1986 



Sequential 

2262 

27T3 



6284 

7662 


10000 

MinPrimer 

2067 

2467 



5939 

6976 

5 


MinProbe 

2363 

2875 

4154 

5118 

6324 

7651 



Sequential 

2779 

BiSiDtS 

5347 

6540 

9139 

11399 


20000 

MinPrimer 

2553 


4908 

5956 

8504 

10308 



MinProbe 

2957 

3562 

5520 

6808 

9222 

11530 



Sequential 

4020 


8753 

10211 

17580 

21359 


100000 

MinPrimer 

3738 

msm 

8122 

9494 

16252 

19645 



MinProbe 

4509 

5208 

9284 

11078 

18048 

22119 



Sequential 

4538 

5035 

10286 

11738 

21762 

25859 


200000 

MinPrimer 

4264 

4749 

9609 

11054 

20226 

24058 



MinProbe 

5221 

5926 

11149 

12986 

22602 

27186 
















Table 3. Size of the strongly r-decodable pool subset computed by the three MDPSP algorithms for primer 
length 20 and 2 primers per pool, with number of possible extensions \Ep\ € {2, 4}, redundancy r £ {1, 2, 5} and 
all fc-mer probe sets for k £ {8, 9, 10} (averages over 10 test cases). 


r 

# 

SNPs 

Algorithm 

k= 

|£^p| -4 


k= 

\E^\^A 

9 

Spl - 2 

k= 

\E^\^A 

10 

\Ep\ — 2 



Sequential 

1000 

1000 

1000 

1000 

1000 

1000 


1000 

MinPrimer 

1000 

1000 

1000 

1000 

1000 

1000 



MinProbe 

1000 

1000 

1000 

1000 

1000 

1000 



Sequential 

2000 

2000 

2000 

2000 

2000 

2000 


2000 

MinPrimer 

2000 

2000 

2000 

2000 

2000 

2000 



MinProbe 

2000 

2000 

2000 

2000 

2000 

2000 



Sequential 

8574 

8950 

10000 

10000 

10000 

10000 


10000 

MinPrimer 

8319 

8752 

9999 

10000 

10000 

10000 

1 


MinProbe 

8803 

9358 

10000 

10000 

10000 

10000 



Sequential 

11071 

nKlQKtf 



20000 

20000 


20000 

MinPrimer 





20000 

20000 



MinProbe 

11107 


19989 

19998 

20000 

20000 



Sequential 

12656 

13813 

47688 

50643 

98630 

99478 


100000 

MinPrimer 

15324 

16551 

48021 

52263 

96712 

98209 



MinProbe 

15672 

16800 

48418 

52712 

99601 

99885 



Sequential 

12658 

13890 

51646 

55694 

157908 

166796 


200000 

MinPrimer 

17010 

18216 

56017 

60962 

154028 

164696 



MinProbe 

17630 

18783 

56676 

61488 

160532 

173910 



Sequential 

1000 

1000 

1000 

1000 

1000 

1000 


1000 

MinPrimer 

1000 

1000 

1000 

1000 

1000 

1000 



MinProbe 

1000 

1000 

1000 

1000 

1000 

1000 



Sequential 

2000 

2000 

2000 

2000 

2000 

2000 


2000 

MinPrimer 

2000 

2000 

2000 

2000 

2000 

2000 



MinProbe 

2000 

2000 

2000 

2000 

2000 

2000 



Sequential 

6901 



10000 

10000 

10000 


10000 

MinPrimer 

6463 


Hshh 


10000 

10000 

2 


MinProbe 

6890 

7443 

9998 


10000 

10000 



Sequential 

8192 

8639 

19274 

19670 

20000 

20000 


20000 

MinPrimer 

7662 

8348 

18455 

18988 

20000 

20000 



MinProbe 

8068 

8808 

19288 

19661 

20000 

20000 



Sequential 

9644 

10175 

34855 

36886 

90627 

94420 


100000 

MinPrimer 

9605 

10398 

32700 

35771 

85852 

90098 



MinProbe 

10273 

11093 

34481 

37743 

90935 

94868 



Sequential 

9953 

10535 

37891 

40060 

122470 

130911 


200000 

MinPrimer 

10333 

11143 

36247 

39619 

114624 

125287 



MinProbe 

11246 

12068 

38498 

41857 

122986 

134342 



Sequential 

1000 

1000 


1000 

1000 

1000 


1000 

MinPrimer 

999 

1000 

BiiliiiH 

1000 

1000 

1000 



MinProbe 

1000 

1000 


1000 

1000 

1000 



Sequential 

1973 

1989 

2000 

2000 

2000 

2000 


2000 

MinPrimer 

1898 

1933 

2000 

2000 

2000 

2000 



MinProbe 

1946 

1975 

2000 

2000 

2000 

2000 



Sequential 

4161 

4405 

9483 

9722 

10000 

10000 


10000 

MinPrimer 

3635 

3970 

8881 

9211 

9998 

9999 

5 


MinProbe 

3845 

4204 

9233 

9546 

10000 

10000 



Sequential 

4705 

4924 


14739 

19967 

19985 


20000 

MinPrimer 

4029 

4391 

12360 

13378 

19804 

19905 



MinProbe 

4286 

4690 

12960 

14110 

19931 

19973 



Sequential 

5520 

5727 

19612 

20634 

59021 

63631 


100000 

MinPrimer 

4726 

5114 

16781 

18352 

52711 

57521 



MinProbe 

5171 

5581 

17990 

19741 

55573 

61043 



Sequential 

5775 


21251 

22193 

70334 

75361 


200000 

MinPrimer 

4990 


18116 

19732 

61406 

67565 



MinProbe 

5531 

5939 

19757 

21555 

65344 

72313 






































Table 4. Size of the strongly r-decodable pool subset computed by the three MDPSP algorithms for primer 
length 20 and 2 primers per pool, with number of possible extensions |_Ep| € {2, 4}, redundancy r G {1, 2, 5} and 
all c-token probe sets for c € {11,12,13} (averages over 10 test cases). 


r 

# 

SNPs 

Algorithm 

II 

“ II 

11 

\E-p\ - 2 

c=12 

|£;p| ^ 4 \Ep\ ^ 2 

c=13 

\Ep \ ^ 4 \Ep \ ^ 2 



Sequential 

1000 

1000 

1000 

1000 

1000 

1000 


1000 

MinPrimer 

999 

999 

1000 

1000 

1000 

1000 



MinProbe 

1000 

1000 

1000 

1000 

1000 

1000 



Sequential 

1982 

1990 

2000 

2000 

2000 

2000 


2000 

MinPrimer 

1959 

1968 

1998 

1998 

2000 

2000 



MinProbe 

1994 

1998 

2000 

2000 

2000 

2000 



Sequential 

6993 

7324 

9218 

9412 

9927 

9953 


10000 

MinPrimer 

6401 

6776 

8782 

9034 

9801 

9866 

1 


MinProbe 

7972 

8280 

9688 

9782 

9980 

9990 



Sequential 

9733 

10358 

15191 

15843 

18931 

19197 


20000 

MinPrimer 

8798 

9489 

14080 

14797 

18204 

18573 



MinProbe 

11548 

12187 

17094 

17599 

19613 

19746 



Sequential 

16042 

17216 

32202 

34459 

56064 

59498 


100000 

MinPrimer 

14736 

15817 

29336 

31608 

51540 

55031 



MinProbe 

20277 

21599 

39202 

41665 

65877 

69188 



Sequential 

18516 

19789 

39552 

42556 

76037 

81443 


200000 

MinPrimer 

17278 

18483 

36618 

39500 

70048 

75470 



MinProbe 

24329 

25757 

49300 

52534 

91488 

97154 



Sequential 

998 

998 

1000 

1000 

1000 

1000 


1000 

MinPrimer 

986 

990 

999 

1000 

1000 

1000 



MinProbe 

998 

999 

1000 

1000 

1000 

1000 



Sequential 

1905 

1931 

1995 

1998 

2000 

2000 


2000 

MinPrimer 

1815 

1852 

1981 

1986 

2000 

2000 



MinProbe 

1948 

1962 

1997 

1999 

2000 

2000 



Sequential 

5107 

5431 

7891 

8231 

9611 

9716 


10000 

MinPrimer 

4571 

4924 

7252 

7621 

9214 

9381 

2 


MinProbe 

6059 

6372 

8610 

8833 

9783 

9851 



Sequential 

6589 

7036 

11615 

12312 

16839 

17409 


20000 

MinPrimer 

5901 

6388 

10551 

11255 

15613 

16231 



MinProbe 

8151 

8674 

13540 

14184 

17980 

18396 



Sequential 

9839 

10552 

20811 

22486 

39839 

42814 


100000 

MinPrimer 

9071 

9819 

19192 

20864 

36595 

39542 



MinProbe 

12695 

13562 

26341 

28190 

48131 

51125 



Sequential 

11114 

11894 

24696 

26659 

50811 

54858 


200000 

MinPrimer 

10418 

11212 

23155 

25122 

47357 

51390 



MinProbe 

14541 

15467 

31714 

34015 

63112 

67567 



Sequential 

906 

932 

992 

996 

1000 

1000 


1000 

MinPrimer 

837 

868 

971 

981 

999 

999 



MinProbe 

905 

928 

990 

994 

1000 

1000 



Sequential 

1433 

1497 

1870 

1896 

1991 

1995 


2000 

MinPrimer 

1284 

1350 

1753 

1800 

1960 

1974 



MinProbe 

1437 

1511 

1856 

1885 

1986 

1990 



Sequential 

2713 

2944 

4988 

5343 

7662 

8000 


10000 

MinPrimer 

2467 

2668 

4495 

4825 

6976 

7324 

5 


MinProbe 

2875 

3081 

5118 

5436 

7651 

7988 



Sequential 

3279 

3552 

6540 

7040 

11399 

12143 


20000 

MinPrimer 

2998 

3273 

5956 

6424 

10308 

11007 



MinProbe 

3562 

3817 

6808 

7314 

11530 

12240 



Sequential 

4536 

4912 

10211 

11140 




100000 

MinPrimer 

4250 

4610 

9494 

10352 


BSIBI 



MinProbe 

5208 

5602 

11078 

11932 


23977 



Sequential 

5035 

5443 

11738 

12809 

25859 

28234 


200000 

MinPrimer 

4749 

5128 

11054 

12022 

24058 

26297 



MinProbe 

5926 

6363 

12986 

13987 

27186 

29439 









160000 



-r=2, n= 
-r=2, n= 

- r=2, n= 
r=2, n: 

-r=2, n: 
-r=2, n: 
-r=5, n= 
-r=5, n= 

- r=5, n= 
-r=5, n: 
-r=5, n: 
-r=5, n: 


^200k. 

^200k. 

i200k. 

^lOOk, 

^lOOk, 

ilOOk, 

^200k, 

200k. 

200k. 

^lOOk, 

^lOOk, 

^lOOk, 


MinProbe 

Sequential 

MinPrimer 

MinProbe 

Sequential 

MinPrimer 

MinProbe 

Sequential 

MinPrimer 

MinProbe 

Sequential 

MinPrimer 


(a) 



—r=2, n=200k, MinProbe 
—r=2, n=200k, Sequential 
r=2, n=200k, MinPrimer 
r=2, n=100k, MinProbe 
—r=2, n=100k, Sequential 
—r=2, n=100k, MinPrimer 
—I— r=5, n=200k, MinProbe 
—— r=5, n=200k, Sequential 
—X- r=5, n=200k, MinPrimer 
—o— r=5, n=100k, MinProbe 
— r=5, n=100k, Sequential 
—6— r=5, n=100k, MinPrimer 


10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 


primer length 


(b) 

Fig. 7. Size of the strongly r-decodable pool subset computed by the three MDPSP algorithms as a function of 
primer length, for pools with 2 primers, 2 possible extensions per primer, and array probes consisting of all 
10-mers (a), respectively all 645,376 13-tokens (b) (averages over 10 test cases). 

































































































































































Table 5. Number of arrays needed to cover 90 — 95% of the reference SNPs that have unambiguous primers of 
length 20. 


Chr 

# 

# 

1 ^ 10-mer arrays 

1 ^ 13-token arrays 

ID 

Ref. 

Extracted 

r= 

4 

1 ^^2 

1 

r= 

4 

r= 

--2 

1 r=5 


SNPs 

Pools 

90% 

95% 

90% 

95% 

90% 

95% 

90% 

95% 

90% 

95% 

90% 

95% 

1 

786058 

736850 

5 

7 

8 

11 

15 

24 

10 

14 

17 

23 

39 

56 

2 

758368 

704415 

5 

6 

7 

9 

14 

18 

9 

12 

14 

18 

32 

42 

3 

647918 

587531 

5 

6 

7 

8 

13 

16 

8 

10 

12 

15 

26 

35 

4 

690063 

646534 

5 

6 

7 

9 

14 

17 

8 

10 

12 

15 

26 

34 

5 

590891 

550794 

5 

6 

6 

8 

12 

16 

7 

10 

12 

15 

26 

34 

6 

791255 

742894 

10 

20 

14 

29 

30 

54 

15 

29 

23 

38 

49 

73 

7 

666932 

629089 

6 

9 

8 

12 

16 

25 

10 

15 

16 

22 

36 

48 

8 

488654 

456856 

4 

5 

5 

7 

10 

12 

7 

8 

10 

13 

22 

29 

9 

465325 

441627 

4 

6 

6 

8 

11 

17 

7 

10 

11 

16 

26 

36 

10 

512165 

480614 

4 

6 

6 

8 

11 

16 

8 

10 

12 

16 

27 

38 

11 

505641 

476379 

4 

6 

6 

8 

11 

15 

8 

10 

12 

15 

26 

35 

12 

474310 

443988 

4 

6 

6 

8 

11 

18 

7 

10 

11 

15 

25 

36 

13 

371187 

347921 

3 

4 

5 

6 

9 

11 

5 

7 

8 

10 

16 

22 

14 

292173 

271130 

3 

4 

4 

5 

7 

10 

5 

7 

8 

10 

16 

23 

15 

277543 

258094 

3 

4 

4 

5 

7 

11 

5 

7 

8 

10 

17 

24 

16 

306530 

288652 

4 

6 

5 

9 

9 

18 

7 

10 

11 

15 

25 

35 

17 

269887 

249563 

3 

5 

4 

8 

9 

18 

7 

10 

11 

15 

25 

37 

18 

268582 

250594 

3 

3 

4 

5 

7 

9 

4 

6 

6 

8 

14 

18 

19 

212057 

199221 

4 

6 

5 

9 

11 

21 

8 

11 

12 

17 

29 

43 

20 

292248 

262567 

3 

4 

4 

5 

7 

11 

6 

8 

9 

12 

20 

27 

21 

148798 

138825 

2 

3 

3 

3 

5 

6 

3 

4 

5 

6 

10 

13 

22 

175939 

164632 

3 

4 

3 

6 

6 

13 

6 

8 

9 

12 

21 

29 

X 

380246 

362778 

4 

6 

6 

8 

10 

15 

6 

9 

9 

13 

19 

26 

Y 

50725 

49372 

2 

2 

2 

2 

3 

3 

2 

2 

2 

3 

4 

5 




























